Skip to content

Conversation

@dariocazzani
Copy link
Contributor

What does this PR do?

Adds a self-contained example showing the bare minimum ScratchGPT usage to train a transformer on Darwin's "On the Origin of Species".
The script downloads text data, trains a small model with CharTokenizer, demonstrates text generation, then auto-cleans all temporary files.

# Core usage pattern shown:
tokenizer = CharTokenizer(text=text)
model = TransformerLanguageModel(config)
trainer = Trainer(model, config.training, optimizer, experiment_path, device)
trainer.train(data=data_source, tokenizer=tokenizer)

@dariocazzani dariocazzani force-pushed the feature/examples_simple branch from a405c0d to 94a65a5 Compare September 13, 2025 13:50
Creates examples/simple.py demonstrating core ScratchGPT usage:
auto-downloads text data, trains small model with CharTokenizer,
shows text generation, uses temp dirs for clean execution
@dariocazzani dariocazzani force-pushed the feature/examples_simple branch from 94a65a5 to 8360915 Compare September 13, 2025 13:53
- Remove subprocess dependency, use Python's built-in urlretrieve() instead
- Eliminate curl requirement that could fail on some systems
- Always download fresh data (removed file existence check)
- Wrap trainer.train() in try-catch to handle KeyboardInterrupt
- Allow users to stop training and proceed to text generation demo
- Add clear instruction about Ctrl-C functionality for better UX
- Ensure text generation always runs regardless of training completion
@dariocazzani dariocazzani force-pushed the feature/examples_simple branch from 5110055 to a7fe81e Compare September 13, 2025 14:30
@ayeganov ayeganov merged commit 8b79deb into main Sep 13, 2025
3 checks passed
@ayeganov ayeganov deleted the feature/examples_simple branch September 13, 2025 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants